In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser
First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use
311-2014.csv
. You can rename it.
Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.
In [2]:
df=pd.read_csv("311-2015.csv",nrows=200000) #Importing the rows.
df.head(5) #Seems to work.
Out[2]:
In [3]:
def parse_date(str_date): #Using a function to convert the dates into datetime format.
return dateutil.parser.parse(str_date)
df['created_datetime']=df['Created Date'].apply(parse_date) #Applying it here right away!
In [4]:
df.index = df['created_datetime'] #Making the index the date so that the magic below can happen.
In [5]:
df.head(5) #It worked! :D
Out[5]:
In [6]:
df.columns #Loads of Columns, most of which won't be relevant anyways.
Out[6]:
What was the most popular type of complaint, and how many times was it filed?
In [7]:
df['Complaint Type'].value_counts().head(5) #Most popular complaints are Blocked Driveway, Illegal Parking and HOT WATER?!
Out[7]:
Make a horizontal bar graph of the top 5 most frequent complaint types.
In [46]:
df['Complaint Type'].value_counts().head(5).plot(kind='barh',title='Top 5 Frequent Complaints') # A Horizontal Bar Graph of Complaints.
Out[46]:
Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.
In [9]:
df['Borough'].value_counts() #Counting Complaints in the Boroughs. Using Population Data, We can find out how many complaints are there per capita.
Out[9]:
Populations of Manhattan, Brooklyn, Queens, Bronx and Staten Island in 2015 : 1,644,518, 2,636,735, 2,339,150, 1,455,444 , 474,558
In [74]:
57129/2636735
Out[74]:
In [75]:
46824/2339150
Out[75]:
In [76]:
42050/1644518 #Manhattanites complain the most per capita.
Out[76]:
In [77]:
29610/1455444
Out[77]:
In [78]:
7387/474558 #Staten Island the least.
Out[78]:
According to your selection of data, how many cases were filed in March? How about May?
In [10]:
len(df['2015-03']) #Cases filed in March.
Out[10]:
In [11]:
len(df['2015-05']) #Cases filed in May.
Out[11]:
I'd like to see all of the 311 complaints called in on April 1st.
Surprise! We couldn't do this in class, but it was just a limitation of our data set
In [12]:
df['2015-04-01'] #You would wish these were jokes. THEY ARENT!
Out[12]:
What was the most popular type of complaint on April 1st?
What were the most popular three types of complaint on April 1st
In [13]:
df['2015-04-01']['Complaint Type'].value_counts().head(3) #Illegal Parking is NOT A JOKE!
Out[13]:
In [ ]:
What month has the most reports filed? How many? Graph it.
In [47]:
df['Complaint Type'].resample('M').count().plot(figsize=(20,8),title='Complaints Per Month in 2015') #Using the resampler and tada!
Out[47]:
There seems to be a peak of complaints in May and October - at the beginning of summer and winter. Coincidence?!
What week of the year has the most reports filed? How many? Graph the weekly complaints.
In [48]:
df['Complaint Type'].resample('W').count().plot(figsize=(20,8),title='Complaints Per Week in 2015.')
Out[48]:
A little more nuanced than the earlier month graph.
Noise complaints are a big deal. Use .str.contains
to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).
In [80]:
noise_df = df[df['Complaint Type'].str.contains("Noise")]
In [85]:
noise_df['created_datetime'].resample('M').count().plot()
Out[85]:
In [87]:
noise_df['created_datetime'].resample('H').count().plot()
Out[87]:
Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.
In [49]:
list = df['Complaint Type'].resample('D').count()
list.sort_values(ascending=False).head().plot(kind='barh',title='Top Days to Complain in 2015.')
Out[49]:
In [ ]:
What hour of the day are the most complaints? Graph a day of complaints.
In [51]:
df['2015-10-28']['Complaint Type'].resample('H').count().plot(figsize=(20,8),title='What time of the day are most complaints being reported')
Out[51]:
In [ ]:
One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?
In [52]:
df['2015-10-28 16']['Complaint Type'].value_counts().head(5) #Something seems wrong with 4 o clock that day.
Out[52]:
In [ ]:
In [ ]:
So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.
In [53]:
df['2015-10-28 00']['Complaint Type'].resample('T').count().plot(figsize=(20,8),title='Per Minute Breakdown of Complaints between 12 AM and 1 AM')
Out[53]:
Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).
In [20]:
df['Agency Name'].value_counts().head(5) #Top 5 Agencies that handle complaints for 311.
Out[20]:
In [21]:
def parse_date_into_time(str_date): #So for us to be able to handle the next questions. I create a function that converts
#time into a HHMM format so that the graphs below can be smoother.
h=str(dateutil.parser.parse(str_date).hour)
m=str(dateutil.parser.parse(str_date).minute)
t=h+":"+m
return dateutil.parser.parse(t).time()
In [22]:
parse_date_into_time('07/06/2015 10:58:27 AM') #Seems to be working.
Out[22]:
In [23]:
df['Upload Time']=df['Created Date'].apply(parse_date_into_time) #create a new column called Upload Time.
In [24]:
df.head(5) #Working.
Out[24]:
In [25]:
nypddf=df[df['Agency Name']=='New York City Police Department']
In [54]:
nypddf['Upload Time'].value_counts().plot(figsize=(20,8),title='Time of the Day NYPD files its reports')
Out[54]:
In [27]:
dphpdf=df[df['Agency Name']=='Department of Housing Preservation and Development']
In [28]:
dphpdf['Upload Time'].value_counts().head(10) #ALL UPLOADED AT MIDNIGHT, by some unpaid intern.
Out[28]:
In [29]:
dotdf=df[df['Agency Name']=='Department of Transportation']
In [55]:
dotdf['Upload Time'].value_counts().plot(figsize=(20,8),title='Time of the Day DOT files its reports')
Out[55]:
It looks like they file reports just before lunch time and before they leave office?!
In [31]:
doprdf=df[df['Agency Name']=='Department of Parks and Recreation']
In [56]:
doprdf['Upload Time'].value_counts().plot(figsize=(20,8),title='Time of the Day DoP&R files its reports')
Out[56]:
They seem to do it throughout the day.
In [33]:
dohmhdf=df[df['Agency Name']=='Department of Health and Mental Hygiene']
In [57]:
dohmhdf['Upload Time'].value_counts().plot(figsize=(20,8),title='Time of the Day DoHMH files its reports')
Out[57]:
Only at Midnight, they upload their data.
In [35]:
df['Complaint Type'].resample('W').count().plot(figsize=(20,8))
Out[35]:
Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?
In [58]:
nypddf['Complaint Type'].resample('W').count().plot(figsize=(20,8),title='NYPD: Complaints Per Week')
Out[58]:
In [59]:
dphpdf['Complaint Type'].resample('W').count().plot(figsize=(20,8),title='DPHP: Complaints Per Week')
Out[59]:
In [60]:
dotdf['Complaint Type'].resample('W').count().plot(figsize=(20,8),title='DOT: Complaints Per Week')
Out[60]:
In [61]:
doprdf['Complaint Type'].resample('W').count().plot(figsize=(20,8),title='DOPR: Complaints Per Week')
Out[61]:
In [62]:
dohmhdf['Complaint Type'].resample('W').count().plot(figsize=(20,8),title='DOHMD: Complaints Per Week')
Out[62]:
Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.
In [41]:
nypddf['2015-07']['Complaint Type'].value_counts().head() #Most Popular Complaints in July
Out[41]:
In [42]:
nypddf['2015-08']['Complaint Type'].value_counts().head()#Most Popular Complaints in August
Out[42]:
In [43]:
nypddf['2015-05']['Complaint Type'].value_counts().head() #Most Popular Complaints in May
Out[43]:
In [44]:
dphpdf['2015-06':'2015-08']['Complaint Type'].value_counts().head() #Most Popular Complaints in the Summer
Out[44]:
In [45]:
dphpdf['2015-11':'2016-01']['Complaint Type'].value_counts().head() #Most Popular Complaints in the Winter.
Out[45]:
In [ ]: